Skip to content

fix: filter non-productive TCRs, dynamic gene family columns, single-…#82

Open
KevinMLanderos wants to merge 1 commit into
mainfrom
fix_details
Open

fix: filter non-productive TCRs, dynamic gene family columns, single-…#82
KevinMLanderos wants to merge 1 commit into
mainfrom
fix_details

Conversation

@KevinMLanderos
Copy link
Copy Markdown
Collaborator

Three bugs were fixed:

  1. bin/sample_calc.py — Gene family CSVs (v_family, d_family, j_family) were built with hard-coded maximum indices (TRBV: 30, TRBD: 2, TRBJ: 2), silently dropping any gene with a number above those limits. The max index is now derived dynamically from genes observed in each sample. Samples with no valid calls for a gene type write a sample-only row instead of crashing.

  2. modules/local/annotate/main.nf — ANNOTATE_PROCESS did not filter for productive TCRs, so non-productive rearrangements propagated into every downstream file: concatenated_cdr3_sorted.tsv, OLGA pgen calculation, TCRSHARING, and the patient workflow (GIANA, GLIPH2, overlap metrics). The process now reads the 'productive' column when present and retains only productive entries before writing per-sample _cdr3.tsv files, fixing all downstream analyses in one place.

  3. bin/pseudobulk.py — Cell Ranger AIRR output was not filtered for cell or contig quality before pseudobulking. is_cell, high_confidence, and productive filters are now applied in both pseudobulk() and pseudobulk_phenotype() when those columns are present, ensuring background barcodes, low-confidence assemblies, and non-productive contigs are excluded from single-cell input.

…cell quality filters

Three bugs were fixed:

  1. bin/sample_calc.py — Gene family CSVs (v_family, d_family, j_family) were
     built with hard-coded maximum indices (TRBV: 30, TRBD: 2, TRBJ: 2), silently
     dropping any gene with a number above those limits. The max index is now
     derived dynamically from genes observed in each sample. Samples with no valid
     calls for a gene type write a sample-only row instead of crashing.

  2. modules/local/annotate/main.nf — ANNOTATE_PROCESS did not filter for
     productive TCRs, so non-productive rearrangements propagated into every
     downstream file: concatenated_cdr3_sorted.tsv, OLGA pgen calculation,
     TCRSHARING, and the patient workflow (GIANA, GLIPH2, overlap metrics).
     The process now reads the 'productive' column when present and retains only
     productive entries before writing per-sample _cdr3.tsv files, fixing all
     downstream analyses in one place.

  3. bin/pseudobulk.py — Cell Ranger AIRR output was not filtered for cell or
     contig quality before pseudobulking. is_cell, high_confidence, and productive
     filters are now applied in both pseudobulk() and pseudobulk_phenotype() when
     those columns are present, ensuring background barcodes, low-confidence
     assemblies, and non-productive contigs are excluded from single-cell input.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Unit Test Results

10 tests   2 ✅  21s ⏱️
 2 suites  0 💤
 1 files    8 ❌

For more details on these failures, see this check.

Results for commit 389a328.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant